Due Jul 5, 2:59 AM EDT
Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?
We index into the ith row first to get the ith training example (represented by parentheses), then the jth column to get the jth word (represented by the brackets).
Consider this RNN:

This specific type of architecture is appropriate when:
It is appropriate when every input should be matched to an output.
To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply).

Correct!
Correct!
You are training this RNN language model.

At the tth time step, what is the RNN doing? Choose the best answer.
Yes, in a language model we try to predict the next step based on the knowledge of all prior steps.
You have finished training a language model RNN and are using it to sample random sentences, as follows:

What are you doing at each time step t?
You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a<t>. What is the dimension of Γu at each time step?
Correct, Γu is a vector of dimension equal to the number of hidden units in the LSTM.
Here’re the update equations for the GRU.

Alice proposes to simplify the GRU by always removing the Γu. I.e., setting Γu = 1. Betty proposes to simplify the GRU by removing the Γr. I. e., setting Γr = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
Yes. For the signal to backpropagate without vanishing, we need c<t> to be highly dependent on c<t−1>.
Here are the equations for the GRU and the LSTM:

From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _______ and ______ in the GRU. What should go in the blanks?
Yes, correct!
You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as x<1>,…,x<365>. You’ve also collected data on your dog’s mood, which you represent as y<1>,…,y<365>. You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
Yes!